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Abstract 

In this paper, we present a novel approach for med¬ 
ical synonym extraction. We aim to integrate the 
term embedding with the medical domain knowl¬ 
edge for healthcare applications. One advantage of 
our method is that it is very scalable. Experiments 
on a dataset with more than IM term pairs show 
that the proposed approach outperforms the base¬ 
line approaches by a large margin. 

1 Introduction 

Many components to build a high quality natural language 
processing system rely on the synonym extraction. Exam¬ 
ples include query expansion, text summarization [Barzilay 
andElhadad, 1999], question answering [Eerrucci, 2012], and 
paraphrase detection. Although the value of synonym extrac¬ 
tion is undisputed, manual construction of such resources is 
always expensive, leading to a low knowledgebase (KB) cov¬ 
erage [Henriksson et al., 2014]. 

In the medical domain, this KB coverage issue is more 
serious, since the language use variability is exceptionally 
high [Meystre et al., 2008]. In addition, the natural language 
content in the medical domain is also growing at an extremely 
high speed, making people hard to understand it, and update 
it in the knowledgebase in a timely manner. 

To construct a large scale medical synonym extraction sys¬ 
tem, the main challenge to address is how to build a sys¬ 
tem that can automatically combine the existing manually 
extracted medical knowledge with the huge amount of the 
knowledge buried in the unstructured text. In this paper, we 
construct a medical corpus containing 130M sentences (20 
gigabytes pure text). We also construct a semi-supervised 
framework to generate a vector representation for each medi¬ 
cal term in this corpus. Our framework extends the Word2Vec 
model [Mikolov et al., 2013] by integrating the existing med¬ 
ical knowledge in the model training process. 

To model the concept of synonym, we build a “concept 
space” that contains both the semi-supervised term embed¬ 
ding features and the expanded features that capture the sim¬ 
ilarity of two terms on both the word embedding space and 
the surface form. We then apply a linear classifier directly 
to this space for synonym extraction. Since both the manu¬ 
ally extracted medical knowledge and the knowledge buried 


under the unstructured text have been encoded in the concept 
space, a cheap classifier can produce satisfying extraction re¬ 
sults, making it possible to efficiently process a huge amount 
of the term pairs. 

Our system is designed in such a way that both the exist¬ 
ing medical knowledge and the context in the unstructured 
text are used in the training process. The system can be di¬ 
rectly applied to the input term pairs without considering the 
context. The overall contributions of this paper on medical 
synonym extraction are two-fold: 

• Erom the perspective of applications, we identify 
a number of Unified Medical Language System 
(UMLS) [Lindberg et al, 1993] relations that can be 
mapped to the synonym relation (Table 1), and present 
an automatic approach to collect a large amount of the 
training and test data for this application. We also apply 
our model to a set of 1 IB medical term pairs, resulting in 
a new medical synonym knowledgebase with more than 
3M synonym candidates unseen in the previous medical 
resources. 

• Erom the perspective of methodologies, we present a 
semi-supervised term embedding approach that can train 
the vector space model using both the existing medical 
domain knowledge and the text data in a large corpus. 
We also expand the term embedding features to form a 
concept space, and use it to facilitate synonym extrac¬ 
tion. 

The experimental results show that our synonym extrac¬ 
tion models are fast and outperform the state-of-the-art ap¬ 
proaches on medical synonym extraction by a large margin. 
The resulting synonym KB can also be used as a comple¬ 
ment to the existing knowledgebases in information extrac¬ 
tion tasks. 

2 Related Work 

A wide range of techniques has been applied to synonym de¬ 
tection, including the use of lexicosyntactic patterns [Hearst, 
1992], clustering [Brown et al., 1992], graph-based mod¬ 
els [Blondel et al., 2004][Nakayama et al., 2007][Wang and 
Hirst, 2009] [Curran, 2008] and distributional semantics [Du- 
mais and Landauer, 1997][Henriksson et al., 2014][Henriks- 
son et al, 2013a][Henriksson et al., 2013b][Zeng et al.. 



2012]. There are also efforts to improve the detection perfor¬ 
mance using multiple sources or ensemble methods [Curran, 
2002] [Wu and Zhou, 2003][Peirsman and Geeraerts, 2009]. 

The vector space models are directly related to synonym 
extraction. Some approaches use the low rank approximation 
idea to decompose large matrices that capture the statistical 
information of the corpus. The most representative method 
under this category is Latent Semantic Analysis (LSA) [Deer- 
wester et al., 1990]. Some new models also follow this ap¬ 
proach like Hellinger PC A [Lebret and Collobert, 2014] and 
GloVe [Pennington et al., 2014]. 

Neural network based representation learning has attracted 
a lot of attentions recently. One of the earliest work was done 
in [Rumelhart et al, 1986]. This idea was then applied to 
language modeling [Bengio et al, 2003], which motivated 
a number of research projects in machine learning to con¬ 
struct the vector representations for natural language process¬ 
ing tasks [Collobert and Weston, 2008] [Turney and Pantel, 
2010][Glorot et al., 2011] [Socher et al., 2011] [Mnih and Teh, 
2012][Passos etal.,20U]. 

Following the same neural network language modeling 
idea, Word2Vec [Mikolovef aZ., 2013] significantly simplifies 
the previous models, and becomes one of the most efficient 
approach to learn word embeddings. In Word2Vec, there are 
two ways to generate the “input-desired output” pairs from 
the context: “SkipGram” (predicts the surrounding words 
given the current word) and “CBOW” (predicts the current 
word based on the surrounding words), and two approaches 
to simplify the training: “Negative Sampling” (the vocabulary 
is represented in one hot representation, and the algorithm 
only takes a number of randomly sampled “negative” exam¬ 
ples into consideration at each training step) and “Hierarchi¬ 
cal SoftMax” (the vocabulary is represented as a Huffman bi¬ 
nary tree). So one can train a Word2Vec model from the input 
under 4 different settings, like “SkipGram”H-“Negative Sam¬ 
pling”. Word2Vec is the basis of our semi-supervised word 
embedding model, and we will discuss it with more details in 
Section 4.1. 

3 Medical Corpus, Concepts and Synonyms 

3.1 Medical Corpus 

Our medical corpus has incorporated a set of Wikipedia arti¬ 
cles and MEDLINE abstracts (2013 version)'. We also com¬ 
plemented these sources with around 20 medical journals and 
books like Merck Manual of Diagnosis and Therapy. In total, 
the corpus contains about 130M sentences (about 20G pure 
text), and about 15M distinct terms in the vocabulary set. 

3.2 Medical Concepts 

A significant amount of the medical knowledge has al¬ 
ready been stored in the Unified Medical Language System 
(UMLS) [Lindberg et al., 1993], which includes medical con¬ 
cepts, definitions, relations, etc. The 2012 version of the 
UMLS contains more than 2.7 million concepts from over 
160 source vocabularies. Each concept is associated with a 
unique keyword called GUI (Concept Unique Identifier), and 

'http://www.nlm.nih.gov/bsd/pmresources.html 


each CUI is associated with a term called preferred name. 
The UMLS consists of a set of 133 subject categories, or se¬ 
mantic types, that provide a consistent categorization of all 
GUIs. The semantic types can be further grouped into 15 se¬ 
mantic groups. These semantic groups provide a partition of 
the UMLS Metathesaurus for 99.5% of the concepts. 

Domain specific parsers are required to accurately process 
the medical text. The most well-known parsers in this area in¬ 
clude MetaMap (Aronson, 2001) and MedicalESG, an adap¬ 
tation of the English Slot Grammar parser [McCord et al., 
2012] to the medical domain. These tools can detect medical 
entity mentions in a given sentence, and automatically asso¬ 
ciate each term with a number of GUIs. Not all the GUIs are 
actively used in the medical text. Eor example, only 150K 
GUIs have been identified by MedicalESG in our corpus, 
even though there are in total 2.7M GUIs in UMLS. 

3.3 Medical Synonyms 

Synonymy is a semantic relation between two terms with very 
similar meaning. However, it is extremely rare that two terms 
have the exact same meaning. In this paper, our focus is to 
identify the near-synonyms, i.e. two terms are interchange¬ 
able in some contexts [Saeed, 2009]. 

The UMLS 2012 Release contains more than 600 relations 
and 50M relation instances under 15 categories. Each cat¬ 
egory covers a number of relations, and each relation has a 
certain number of GUI pairs that are known to bear that re¬ 
lation. Erom UMLS relations, we manually choose a subset 
of them that are directly related to synonyms, and summarize 
them in Table 1. In Table 2, we list several synonym examples 
provided by these relations. 


Table 1: UMLS Relations Gorresponding to the Synonym 
Relation, where “RO” stands for “has relationship other than 
synonymous, narrower, or broader”, “RQ” stands for “related 
and possibly synonymous”, and “SY” stands for “source as¬ 
serted synonymy”. 


Gategory 

Relation Attribute 

RO 

has_active ingredient 

RO 

has ingredient 

RO 

has_product_component 

RO 

has .tradename 

RO 

refers _to 

RQ 

has .alias 

RQ 

replaces 

SY 

- 

SY 

expanded_form_of 

SY 

has .expandedTorm 

SY 

has_multiieveLcategory 

SY 

has_print_name 

SY 

has .single Jevel .category 

SY 

has .tradename 

SY 

same.as 



Table 2: Several Synonym Examples in UMLS. 


term 1 

term 2 

Certolizumab pegol 

Cimzia 

Child sex abuse (finding) 

Child Sexual Abuse 

Mass of body structure 

Space-occupying mass 

Multivitamin tablet 

Vitamin A 

Blood in stool 

Hematochezia 

Keep Alert 

Caffeine 


4 Medical Synonym Extraction with Concept 
Space Models 

In this section, we first project the terms to a new vector 
space, resulting in a vector representation for each term (Sec¬ 
tion 4.1). This is done by adding the semantic type and se¬ 
mantic group knowledge from UMLS as extra labels to the 
Word2Vec model training process. In the second step (Sec¬ 
tion 4.2), we expand the resulted term vectors with extra fea¬ 
tures to model the relationship between two terms. All these 
features together form a concept space and are used with a 
linear classifier for medical synonym extraction. 

4.1 Semi-Supervised Term Embedding 
High Level Idea 

A key point of the Word2Vec model is to generate a “desired 
word” for each term or term sequence from the context, and 
then learn a shallow neural network to predict this “term —s- 
desired word” or “term sequence —s- desired word” map¬ 
ping. The “desired word” generation process does not re¬ 
quire any supervised information, so it can be applied to any 
domain. 

In the medical domain, a huge amount of the true la¬ 
bel information in the format of “semantic type”, “semantic 
group”, etc has already be manually extracted and integrated 
in the knowledgebases like UMLS. Our Semi-supervised ap¬ 
proach is to integrate this type of “true label” and the “desired 
word” in the neural network training process to produce a bet¬ 
ter vector representation for each term. This idea is illustrated 
in Ligure 1. 

The Training Algorithm 

The training algorithm used in the Word2Vec model first 
learns how to update the weights on the network edges assum¬ 
ing the word vectors are given (randomly initialized); then it 
learns how to update the word vectors assuming the weights 
on the edges are given (randomly initialized). It repeats these 
two steps until all the words in the input corpus have been 
processed. 

We produce a “true label” vector, and add it as an extra set 
of output nodes to the neural network. This vector has 148 
entries corresponding to 133 medical semantic types and 15 
semantic groups. If a term is associated with some seman¬ 
tic types and semantic groups, the corresponding entries will 
be set to 1, otherwise 0. Increasing the size of the output 
layer will slow down the computation. Lor example, com¬ 
pared to the original Word2Vec “Negative Sampling” strategy 
with 150 negative samples, the semi-supervised method will 
be about 50% slower. 



Ligure 1: An Illustration of the Semi-supervised Word Em¬ 
bedding Model. 

This section derives new update rules for word vectors and 
edge weights. Since the semi-supervised approach is an ex¬ 
tension of the original Word2vec model, the derivation is also 
similar. The new update rules work for all 4 Word2Vec set¬ 
tings mentioned in Section 2. The notations used in this sec¬ 
tion are defined in Ligure 2. 


5{x) = 1/(1 -I- e “); 

U is a vocabulary set with \V\ distinct words; 

W = {tui • • • W|y| } is the set of embeddings for all the 
words in V ; 

Wk is the embedding for word fc in U; 

9k is a mapping function associated with word /c; 

X = {xi, ■ ■ ■ xn} represents all the words in a corpus 

A; 

Xi is the word in A; 

Ci = [tUcij ■ ■ ■ represents the embeddings of all I 
context words for word Xi in A; 

L is a label set with \L\ distinct labels; 

Xu represents all the words without known labels in A; 
Xu represents all the words with known labels in A; 

X = {Xl,Xu}; 

€ i is a set of known labels associated with xf, Note 
that Xi may have multiple labels; 

Yi = {}, when Xi’s label is unknown; 

T] is the learning rate; 


Ligure 2: Notations 

It can be verified that the following formulas will always 












hold; 


d{5{x))/dx = 5{x) ■ (1 — S{x)) 
d(\ogd{x))/dx = 1 — S(x) 
9(log(l — 5{x)))ldx = —5{x) 


TTs{x) is defined as 


d\ogF{xi,u)/dci 

= - K^'uCi)) ■ Ou 


The new update rules to update (weights on the edges) and 
Ci (word vectors) are as follows: 

0u = Ou + r] ■ - 6{0'^Ci)) ■ Ci ( 1 ) 

Ci = Ci + ?7 • - 5{0'^Ci)) ■ 0u (2) 


r 1, X G S'; 

1 0 , otherwise; 


Given a word Xi, we first compute the context vector Ci by 
concatenating the embeddings of all the contextual words of 
Xi. Then we simultaneously maximize the probability to pre¬ 
dict its “desired word” and the “true label” (if known). 


/(■w|ci) = 


6{0'^Ci),Tr^^.^Yi}{u) = 1 

1 - 6{0[j^Ci),TT[^^^Yi}{u) = 0 


f{u\ci) can also be written as 


4.2 Expansion of the Raw Features 

The semi-supervised term embedding model returns us with 
the vector representation for each term. To capture more in¬ 
formation useful for synonym extraction, we expand the raw 
features with several heuristic rule-based matching features 
and several other simple feature expansions. All these ex¬ 
panded features together with the raw features form a concept 
space for synonym extraction. 

Notations used in this section are summarized here: for any 
pair of the input terms: a and b, we represent their lengths 
(number of words) as |a| and | 6 |. a and b can be multi-word 
terms (like “United States”) or single-word terms (like “stu¬ 
dent”). We represent the raw feature vector of a as A, and the 
raw feature vector of 5 as i?. 


To make the explanation simpler, we define F{xi, u), F{xi) 
and F as follows: 

F{xi,u) = f{u\ci), F{xi) = f{u\ci), 

u^V Xi&X 

To achieve our goal, we want to maximize F for the given 
X: 

^ = n n 

Xi&x iiey 

Apply log to both sides of the above equation, we have 
logU = ^log/(w|c*) 

XiGX u&V 

= E E log[< 5 (i?;c.)-<--n>(“) 

XiGX uGV 

= E E 

XiGX uGV 

+ (1 - T^{x„Yi} {u)) ■ l0g(l - 6{0[,Ci)) 

Now we compute the first derivative of log F{xi, u) on both 

0u and Ci'. 

d\ogF{x^,u)/d0u 

= '^{xi,Y,}{u)- {^-^{0Lci))-Ci 

- - T^{x,,Yi}('^)) ' 

= i'^{xi,Yi}{u) - 5{0'uCi)) ■ Ci 


Rule-based Matching Features 

The raw features can only help model the distributional simi¬ 
larity of two terms based on the corpus and the existing med¬ 
ical knowledge. In this section, we provide several matching 
features to model the similarity of two terms based on the sur¬ 
face form. These rule-based matching features toi to mg are 
generated as follows: 

• mi: returns the number of the common words shared by 
a and b. 

• m 2 : TOi/(|a| • | 6 |); 

• m 3 : if a and b only differ by an antonym prefix, returns 
1; otherwise, 0. The antonym prefix list includes char¬ 
acter sequences like “anti”, “dis”, “il”, “im”, “in”, “ir” 
“non” and “un”, et al. For example, m 3 = 1 for “like” 
and “dislike”. 

• 7714 : if all the upper case characters from a and b match 
each other, returns 1; otherwise, 0. For example, m 4 = 1 
for “USA” and ’’United States of America”. 

• ms: if all the first characters in each word from a and b 
match each other, returns 1; otherwise, 0. For example, 
ms = 1 for “hs” and “hierarchical softmax”. 

• mg: if one term is the subsequence of another term, re¬ 
turns 1; otherwise, 0. 

Feature Expansions 

In addition to the matching features, we also produce several 
other straightforward expansions, including 

• “sum”: [A + B] 

• “difference”: [ |A — B\] 

• “product”: [A ■ B] 

• [m 2 - A , m 2 ■ B] 



5 Experiments 

In our experiments, we used the MedicalESG parser [McCord 
etal, 2012] to parse all 130M sentences in the corpus, and ex¬ 
tracted all the terms from the sentences. For those terms that 
are associated with GUIs, we assigned each of them with a set 
of semantic types/groups through a GUI lookup in UMLS. 

Section 5.1 presents how the training and test sets were cre¬ 
ated. Section 5.2 is about the baseline approaches used in our 
experiments. In Section 5.3, we compare the new models and 
the baseline approaches on medical synonym extraction task. 
To measure the scalability, we also used our best synonym 
extractor to build a new medical synonym knowledgebase. 
Then we analyze the contribution of each individual feature 
to the final results in Section 5.4. 

5.1 Data Collection 

UMLS has about 300K CUI pairs under the relations (see Ta¬ 
ble 1) corresponding to synonyms. However, the majority of 
them contain the GUIs that are not in our corpus. The goal 
of this paper is to detect new synonyms from text, so only the 
relation instances with both GUIs in the corpus are used in the 
experiments. A GUI starts with letter ‘G’, and is followed by 
7 digits. We use the preferred name, which is a term, as the 
surface form for each GUI. Our preprocessing step resulted in 
a set of 8,000 positive synonym examples. To resemble the 
real-world challenges, where most of the given term pairs are 
non-synonyms, we randomly generated more than 1.6M term 
pairs as the negative examples. For these negative examples, 
both terms are required to occur at least twice in our corpus. 
Some negative examples generated in this way may be in fact 
positives, but this should be very rare. 

The final dataset was split into 3 parts: 60% examples were 
used for training, 20% were used for testing the classifiers, 
and the remaining 20% were held out to evaluate the knowl¬ 
edgebase construction results. 

5.2 Baseline Approaches 

Both the USA model [Deerwester et al, 1990] and the 
Word2Vec model [Mikolov et al., 2013] were built on our 
medical corpus as discussed in Section 3.1, which has 
about DOM sentences and 15M unique terms. We con¬ 
structed Word2Vec as well as the semi-supervised term em¬ 
bedding models under all 4 different settings: HSh-GBOW, 
HS-tSkipGram, NEGh-GBOW, and NEG-tSkipGram. The pa¬ 
rameters used in the experiments were: dimension_size=100, 
window_size=5, negative=10, and sample_rate=le-5. We ob¬ 
tained 100 dimensional embeddings from all these models. 
Word2Vec models typically took a couple of hours to train, 
while ESA model required a whole day training on a com¬ 
puter with 16 cores and 128G memory. The feature expansion 
(Section 4.2) was applied to all these baseline approaches as 
well. 

The letter n-gram model [Huang et al, 2013] used in our 
experiments was slightly different from the original one. We 
added special letter n-grams on top of the original one to 
model the begin and end of each term. In our letter n-gram 
experiments, we tested n — 2,3 and 4. The letter n-gram 
model does not require training. 


5.3 Synonym Extraction Results 

The focus of this paper is to find a good concept space for 
synonym extraction, so we prefer a simple classifier over a 
complicated one in order to more directly measure the impact 
of the features (in the concept space) on the performance. The 
speed is one of our major concerns, and we have about IM 
training examples to process, so we used the liblinear pack¬ 
age [Fan et al., 2008] in our experiments for its high speed 
and good scalability. In all the experiments, the weight for 
the positive examples was set to 100, due to the fact that most 
of the input examples were negative. All the other parame¬ 
ters were set to the default values. The evaluation of different 
approaches is based on the Fi scores, and the final results are 
summarized in Table 3. 


Table 3: Fi Scores of All Approaches on both the Training 
and the Test Data._ 


Approach 

Training Fi 

Test Fi 

USA 

48.98% 

48.13% 

Letter-BiGram 

55.51% 

50.09% 

Letter-TriGram 

96.30% 

63.37% 

Letter-FourGram 

99.48% 

66.50% 

Word2Vec HSh-GBOW 

64.05% 

63.51% 

Word2Vec HSh-SKIP 

64.23% 

62.65% 

Word2Vec NEGh-GBOW 

59.47% 

58.75% 

Word2Vec NEGh-SKIP 

70.17% 

67.86% 

Goncept Space HSh-GBOW 

68.15% 

65.73% 

Goncept Space HSh-SKIP 

74.25% 

70.74% 

Goncept Space NEGh-GBOW 

63.57% 

60.90% 

Goncept Space NEGh-SKIP 

74.09% 

70.97% 


From the results in Table 3, we can see that the ESA model 
returns the lowest Fi score of 48.13%, followed by the letter 
bigram model (50.09%). The letter trigram and 4-gram mod¬ 
els return very competitive scores as high as 66.50%, and it 
looks like increasing the value of n in the letter-ngram model 
will push the Fi score up even further. However, we have to 
stop at n = 4 for two reasons. Firstly, when n is larger, the 
model is more likely to overfit for the training data, and the 
Fi score for the letter 4-gram model on the training data is 
already 99.48%. Secondly, the resulting model will be more 
complicated for n > 4. For the letter 4-gram model, the 
model file itself is already about 3G big on the disk, making 
it very expensive to use. 

The best setting of the Word2Vec model returns a 67.86% 
Fi score. This is about 3% lower than the best setting of the 
concept space model, which achieves the best Fi score across 
all approaches: 70.97%. We also compare the Word2Vec 
model and the concept space model under all 4 different set¬ 
tings in Table 4. The concept space model outperforms the 
original Word2Vec model by a large margin (3.9% on aver¬ 
age) under all of them. 

Since we have a lot of training data, and are using a linear 
model as the classifier, the training part is very stable. We 
ran a couple of other experiments (not reported in this paper) 
by expanding the training set and the test set with the dataset 
held out for knowledgebase evaluation, but did not see too 
much difference in terms of the Fi scores. 




Table 5: Analysis of the Feature Contributions to Fi for the Concept Space Model. 


Fi 

HSh- 

CBOW 

HSh- 

SKIP 

NEGh- 

CBOW 

NEGh- 

SKIP 

Average 

Score 

Average 

Improvement 


44.28% 

51.19% 

41.42% 

53.37% 

47.57% 

- 

H- [matching features] 

57.81% 

63.53% 

54.63% 

62.37% 

59.59% 

H- 12 . 02 % 

-v[A-yB] and \\A- B\] 

62.71% 

70.37% 

61.04% 

69.02% 

65.79% 

h-6 .20% 

TJA^ 

66 . 20 % 

68.61% 

60.40% 

70.18% 

66.35% 

h-0.56% 

+ [m2 ■ [A, B\\ 

65.73% 

70.74% 

60.90% 

70.97% 

67.09% 

h-0.74% 


Table 4; Fi Scores of the Word2Vec and the Concept Space 
Models on the Test Data._ 


Setting 

Wored2Vec 

Concept Space 

Diff 

HSh-CBOW 

63.51% 

65.73% 

h- 2 . 22 % 

HSh-SKIP 

62.65% 

70.74% 

h-8.09% 

NEGh-CBOW 

58.75% 

60.90% 

h-2.15% 

NEGh-SKIP 

67.86% 

70.97% 

h-3.11% 

Average 

63.19% 

67.09% 

h-3.90% 


Our method was very scalable. It took on average several 
hours to generate the word embedding file from our medical 
corpus with 20G text using 16 x 3.2G cpus and roughly 30 
minutes to finish the training process using one cpu. To mea¬ 
sure the scalability at the apply time, we constructed a new 
medical synonym knowledgebase with our best synonym ex¬ 
tractor. This was done by applying the concept space model 
trained under the NEGh-SKIP setting to a set of 1 IB pairs of 
terms. All these terms are associated with GUIs, and occur at 
least twice in our medical corpus. This KB construction pro¬ 
cess finished in less than 10 hours using one cpu, resulting in 
more than 3M medical synonym term pairs. To evaluate the 
recall of this knowledgebase, we checked each term pair in 
the held out synonym dataset against this KB, and found that 
more than 42% of them were covered by this new KB. Preci¬ 
sion evaluation of this KB requires a lot of manual annotation 
effort, and will be included in our future work. 

5.4 Feature Contribution Analysis 

In Section 4.2, we expand the raw features with the match¬ 
ing features and several other feature expansions to model the 
term relationships. In this section, we study the contribution 
of each individual feature to the final results. We added all 
those expanded features to the raw features one by one and 
re-ran the experiments for the concept space model. The re¬ 
sults and the feature contributions are summarized in Table 5. 

The results show that adding matching features, “sum” and 
“difference” features can significantly improve the Fi scores. 
We can also see that adding the last two feature sets does not 
seem to contribute a lot to the average Fi score. However, 
they do contribute significantly to our best Fi score by about 
2 %. 

6 Conclusions 

In this paper, we present an approach to construct a medical 
concept space from manually extracted medical knowledge 
and a large corpus with 20G unstructured text. Our approach 


extends the Word2Vec model by making use of the medical 
knowledge as extra label information during the training pro¬ 
cess. This new approach fits well for the medical domain, 
where the language use variability is exceptionally high and 
the existing knowledge is also abundant. 

Experiment results show that the proposed model outper¬ 
forms the baseline approaches by a large margin on a dataset 
with more than one million term pairs. Euture work includes 
doing a precision analysis of the resulting synonym knowl¬ 
edgebase, and exploring how deep learning models can be 
combined with our concept space model for better synonym 
extraction. 
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